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Abstract 

We present a new network model accounting for multidimensional assortativi- 
ty. Each node is characterized by a number of features and the probability of 
a link between two nodes depends on common features. We do not fix a priori 
the total number of possible features. The bipartite network of the nodes 
and the features evolves according to a stochastic dynamics that depends on 
three parameters that respectively regulate the preferential attachment in the 
transmission of the features to the nodes, the number of new features per 
node, and the power-law behavior of the total number of observed features. 
Our model also takes into account a mechanism of triadic closure. We provide 
theoretical results and statistical estimators for the parameters of the model. 
We validate our approach by means of simulations and an empirical analysis 
of a network of scientific collaborations. 

keyword: complex network, bipartite network, assortativity, homophily, pre¬ 
ferential attachment, triadic closure. 


1 Introduction 

Many complex systems are often described by means of a network of interacting com¬ 
ponents, i.e. a set of nodes connected by links [6l [151 [HI EHl El]. A large number 
of scientific fields involve the study of networks in some form: networks have been 
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used to analyze interpersonal social relationships, communication systems, inter¬ 
national trade, financial systems, co-authorships and citations, protein interaction 
patterns, and much more. Therefore, formal stochastic models and statistical tech¬ 
niques for the analysis of network data have emerged as a major topic of interest 
in diverse areas of study. The distribution of the number of node’s connections is 
well approximated by a power-law in many contexts and preferential attachment 
is generally accepted as the simplest mechanism that can reproduce such a distri¬ 
bution pin. This basic mechanism, however, is only one of the many forces that 
can contribute to shape the evolution of complex networks. For instance, a so¬ 
cial network having power-law degree distribution is an exception rather than the 
rule. In particular, preferential attachment is not able to reproduce the formation 
of social groups, or communities, and the composition of social circles. Assortativ- 
ity (or assortative mixing), called homophily in social networks, is defined as the 
prevalence of network-links between nodes that are similar to each other in some 
respect. Network theorists often analyze assortativity in terms of a node’s degree 
P iZl [521 [53]. Moreover, a large body of research in sociology and, more recently, 
in economics, confirms the presence of a multidimensional assortativity in socio¬ 
economic networks: homophily, along the lines of race and ethnicity, age and sex, 
education, professional background and occupation, shapes complex networks such 
as friendship, marriage, teamwork, co-membership, exchange and communication 
networks P[3[IIl[Il[l6l[IHl[l9l[23[29l[32l[Ml[3Hl[lQl[lIl[fel[lll[Ml[6n]. The as¬ 
sortativity property has been also studied in citation networks: for instance, in [T3] 
authors analyze the citations among papers (the nodes of the network) published in 
journals of the American Physical Society with respect to their PACS classification 
codes, that represent the different research sub-fields. In formal models assortativ¬ 
ity is typically represented by partitioning nodes into different classes (also called 
groups, clusters, or types) related to some (observable or unobservable) features 
[B dSl [201 Eni EB EB EH] EB EB [37], SHI EHj. The assumption that each node can 
belong only to a single class and/or the fact that the number of classes is finite and 
fixed a priori as well as the number of the possible features restrict their applicability. 

We contribute to this growing body of literature by introducing a new stocha¬ 
stic model accounting for multidimensional assortativity. The study of networks of 
papers, such as co-authorship or citation networks lailsl ED SH], is a particularly 
suitable application of our model as the generative processes of features and links 
are consistent with the basic aspects of the model: first, it is a growing network 
process where nodes appear in chronological order and do not exit; second, the links 
are established at the entrance of the nodes and are unchangeable along time; third, 
each node exhibits some features (for example, key-words, main topics, etc.) that 
are unchangeable during time; finally, the set of the features grows in time and the 
evolution of the nodes-features structure is interesting exactly as the process of the 
link-creation among the nodes. Indeed, the description of both phenomena is very 
important for the understanding of the diffusion process of ideas and discoveries in- 
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side a certain research field and among different research fields. Anyway, as we will 
discnss at the end of this paper, onr model can be easily modified and/or enriched 
in order to get variants that better fit networks of a different type. 

In particnlar, besides the link-creation mechanism, onr model provides a stocha¬ 
stic dynamics for the evolntion of the featnres. Differently from the above qnoted 
works (see, for instance, the model in EH and the related discnssion about the se¬ 
lection problem for the dimension of the feature-space), we do not fix a priori the 
total number of possible features but we allow the number of observed features to 
grow in time. The bipartite nodes-features network (i.e. the surrounding context) 
grows according to a stochastic model that depends on three parameters that re¬ 
spectively regulate the preferential attachment in the transmission of the features to 
the nodes, the number of new features per node, and the power-law behavior of the 
total number of observed features. Concerning this point, the present paper may be 
considered as a companion article to [12] . Indeed, both of them provide an evolving 
dynamics for the feature-structure, but they also show some differences. The main 
issue is that here we introduce a parameter that tunes the preferential attachment 
in the transmission of the features to the nodes; while in [12] authors only consider a 
preferential attachment rule. Moreover, in that paper a random “fitness” parameter 
which determines the node’s ability to transmit its own features to other nodes (see 
also 0) is attached to each node; while here we do not take into account fitness 
parameters for nodes. 

Coming from a structural approach, differently from other models which concen¬ 
trate only on assortativity [m SSI ED ES] , our model also accounts for the principle, 
known as triadic closure or transitivity, according to which, if A is a neighbor of B 
and B is a neighbor of C, then A and C have a high chance to be neighbors. This 
principle is widely supported on the empirical ground and it is at the basis of many 
generative network models [101 SS [22l [28l ED ED SD SD SSI EQl ED ED EH]- 
is worthwhile to note that the expression “triadic closure” conceptually refers to a 
link-formation process not depending on the features of the nodes that get attached. 
However, also assortativity can naturally induce closed triplets in the network and 
hence evaluating assortativity and triadic closure separately sometimes may be not 
easy. (For a further discussion on this issue, we refer to the next Section]^) Anyway 
models based on both mechanisms produce more realistic networks. 

The paper is structured as follows. In Sectionj^we describe the basic assumptions 
of our model and the notation used throughout the paper. In Section we present 
our stochastic model, that involves a dynamics for the bipartite network of nodes- 
features and the mechanism underlying the formation of the unipartite (i.e. node¬ 
node) network. In Section we illustrate some theoretical results and we carefully 
explain the meaning of each parameter inside our model. In Section |D we show 
and discuss some statistical tools in order to estimate the model parameters from 
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the data. In Section we provide a number of simulations in order to point out 
the functioning of the model parameters and the ability of the proposed estimation 
tools. Section deals with an application of our model and instruments to a co¬ 
authorship network. Finally, in Section]^ we give our conclusions and discuss some 
future developments. The paper is enriched by an Appendix that contains a theorem 
and its proof, and supplementary simulation results. 

2 Preliminaries 

We assume new nodes sequentially join the network so that node i represents the 
one that comes into the network at time step i. Each node shows a hnite number of 
features, that can be of different kinds (key-words, main topics, spatial/geographical 
contexts, prohle, etc.), and different nodes can share the same features. It is worth¬ 
while to note that we do not specify a priori the total number of possible features 
but we allow the number of observed features to increase along time. On its arrival, 
each new node links to some nodes already present in the system. Firstly, links are 
created according to probabilities that depend on the number of common features 
(multidimensional assortativity). Then additional links can be established by means 
of common neighbors, inducing the closure of some triangles (triadic closure). We 
consider the connections as undirected, non-breakable and we omit self-loops (i.e. 
edges of type {i,i))- In particular, this means that connections are mutual or the 
direction is naturally predehned (for instance, only citations from newer to older 
nodes are possible). We denote the adjacency matrix (symmetric by assumption) 
by A, so that Aij = 1 when there exists a link between nodes i and j, Aij = 0 
otherwise. We set 

Vi(i) = {/ = : Ajj, = 1} 

to be the set of node j’s neighbors at time step i (after the arrival of i). 

We denote by F the binary bipartite network where each row F) represents the 
features of node i: F* ^ = 1 if node i has feature k, F* ^ = 0 otherwise. It represents 
the surrounding context in which the nodes interact. We assume that each Fj is 
unchangeable during time. We take F left-ordered: this means that in the hrst 
row the columns for which Fi^k = 1 are grouped on the left and hence, if the hrst 
node has Ni features, then the columns of F with index k G {1,..., W} represent 
these features. The second node could have some features in common with the hrst 
node (those corresponding to indices k such that fc = 1,..., W and F 2 ^k = 1) and 
some, say W, new features. The latter are grouped on the right of the set for which 
Fi^k = 1, i-e., the columns of F with index k G {W -|- 1,..., A'" 2 } represent the new 
features brought by the second node. This grouping structure persists throughout 
the matrix F and we dehne i-®- 

Ln = overall number of diherent observed features for the hrst n nodes. (2.1) 
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Here is an example of a F matrix with n = 3 nodes: 
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In gray we show the new features brought by each node (in the example Ni = 3, 
A ^2 = 2, iVa = 3 and so Li = 3, L 2 = 5, L 3 = 8 ). Observe that, for every node 
i, the i-th row contains 1 for all the columns with indices k G {-hj-i + 1 ,..., Li} 
(they represent the new features brought by i). Moreover, some elements of the 
columns with indices k G are also 1 (features brought by previous 

nodes adopted by node i). 

3 The model 

Fix a > 0, /3 G [0,1], (5 G [0,1], p G [0,1] and let $ : M —)■ [0,1] be an increasing 
function. The dynamics is the following. Node 1 arrives and shows Ni features, 
where Ni is Poi(Q;)-distributed (the symbol Poi(Q;) denotes the Poisson distribution 
with mean a). Then, for each i > 2, 

• Feature-structure dynamics: Node i arrives and shows a number of fea¬ 
tures as follows: 

— Node i exhibits some of the “old” features brought by the previous nodes 
1,..., i — 1: more precisely, each feature k G {!,..., Li_i} is, indepen¬ 
dently of the others, possessed by node i with probability (that we call 
“inclusion-probability”) 


1 y!~\Fik 

P^{k) = 5- + {l-5) ^^=^^ , (3.1) 

where Fj^k = 1 if node j shows feature k and Fj^k = 0 otherwise. 

— Node i also shows iV* “new” features, where iV* is Poi(Ai)-distributed with 

A. = (3.2) 

{Ni is independent of iVi,... ,iVj_i and of the exhibited “old” features.) 

The matrix element F) ^ is set equal to 1 if node i has feature k and equal to 
zero otherwise. 

• Network construction: On its arrival, node i determines a set Ci of neigh¬ 
bors among the nodes already present in the network (so that we set Aij = 
Aj^i = 1 for each j E Ci) as follows: 
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— {First phase) First, a set C* of neighbors of node i is established on the 
basis of the features shown. Each node j already present in the network 
(i.e. 1 < j < ^ — 1) is included in C*, independently of the others, with 
probability where 


Si,j = '^Fi^kFj^k- (3.3) 

k=l 

is the number of features that i and j have in common. 

— {Second phase) Then some extra neighbors are added to Ci on the basis 
of common neighbors. For every node j G {1,... — 1} \ C*, each node 

j' G Vj{i — 1) n £* (i.e. each neighbor that i and j currently share) 
can induce, independently of the others, the additional link {i,j) with 
probability p. 


4 Meaning of the model parameters and some re¬ 
sults 

We now illustrate the meaning of the model parameters and some mathematical 
results regarding our model. 


4.1 The parameters a and [3 

Let us start with a and (d. The main effect of fd is to regulate the asymptotic behavior 


of the random variable defined in (2.1) as a function of n. In particular, /? > 0 is 


the power-law exponent of L„. The main effect of a is the following: the larger a, the 
larger the total number of new features brought by a node. It is worth to note that 
fd hts the asymptotic behavior of and then, separately, a hts the number of new 
observed features per node. (In Section 6.1 we will discuss more deeply this fact.) 


More precisely, we prove (see the Appendix) the following asymptotic behaviors: 

a) for /S = 0, we have a logarithmic behavior of L„, that is L„/ln(n) a; 

b) for fd G (0,1], we obtain a power-law behavior, i.e. Ln/n^ a/fd. 


4.2 The parameter 5 

The parameter 5 tunes the phenomenon of preferential attachment in the spreading 
process of features among nodes. The value <5 = 0 corresponds to the “pure prefer¬ 
ential attachment case”: the larger the weight of a feature k at time step i — 1 (given 


by the numerator of the second element in (3.1), i.e., the total number of nodes that 
exhibit it until time step i — 1), the greater the probability that k will be shown 
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by the future node i. The value 5 = 1 corresponds to the “pure i.i.d. case” with 
inclusion probability equal to 1/2: a node includes each feature with probability 1/2 
independently of the other nodes and the other features. When 6 G (0,1), we have a 
mixture of the two cases above: the smaller 6, the more signihcant is the role played 
by preferential attachment in the transmission of the features to new nodes. 


4.3 The function $ and the parameter p 


According to our model, when a new node enters the system, it links to some (pos¬ 
sibly zero, one, or more) old nodes by means of the two phases network construction 
described in Section]^ In the hrst phase, a new node i connects itself to some of the 
old nodes according to a probability depending on its own features and the ones of 
the others. The function $ relates the “hrst-phase link-probability” of i to j (with 
1 < j < ^ — 1) to their “similarity” S'jj dehned by (3.3). Since <1) is assumed to be 
an increasing function, a higher number of common features between nodes i and j 
induces a larger probability for them to connect (akin the principle of assortativity). 
For instance, we can take the generalization of the logistic function, i.e. the sigmoid 
function 

^ > 0, ^9 e M. (4.1) 


The sigmoid function smoothly increases (from 0 to 1) around a threshold 'd, while 
K controls its smoothness: the bigger K, the steeper the sigmoid. In particular, 
K = 1 and '5 = 0 give the logistic function and, for K —)■ -|-cxd, $ approaches to 
a step function equal to 1 or 0, if the variable s is respectively greater or smaller 
than "d (in our model, 'd > 0 means that the links are established deterministically 
based on whether the two involved nodes have, or not, a similarity bigger than '&). 
In the second phase, node i can connect to some of the nodes discarded in the first 
phase by means of common neighbors (triadic closure). The parameter p regulates 
this phenomenon. Indeed, it represents the probability that a node causes a link 
between two of its neighbors. More precisely, in the second phase, the probability of 
having a link between node i and a node j G {1 ,..., i — 1} \ £* is [l — (1 — , 

where Cjj = card(Vj(9 — 1) fl £*) is the number of common neighbors of i and j 
after the hrst phase. Consequently, the “second-phase link-probability” between a 
pair of nodes increases with respect to p and the number of neighbors they share. 
The case p = 0 corresponds to the case in which the connections only depend on 
the similarity among nodes. The case p = 1 corresponds to the case in which the 
connection is automatically established when Cjj > 0. 


5 Estimation of the model parameters 

In this section we illustrate how to estimate the model parameters from the data. 
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Suppose we can observe the values of Fi,... ,F„, i.e. n rows of the matrix F, 
where n is the number of observed nodes. From the asymptotic behavior of L„, we 
get that ln(L„)/ln(?7,) is a strongly consistent estimator for /?, hence we can use the 
slope [3 of the regression line in the log-log plot (of as a function of n) as an 
estimate for f3. 

After computing (3, we can estimate a as: 


a = 7 when (3 = 0 
a = (3^ when 0 < /3 < 1, 


( 6 . 1 ) 


where 7 is the slope of the regression line in the plot (ln(?7,),L„) or in the plot 

(n^, Ln) according to whether /3 = 0 or /3 G (0,1]. 

We can estimate 6 by means of a maximnm likelihood procednre. For this 
pnrpose, we now give a general expression of the probability of observing Fi = 
fi,..-,Fn = fn given the parameters a, (3, and 6. 

The first row Fi is simply identified by Li = Ni and so 


F(Fi = /i) = P(W = n, = card{fc : /i,^ = 1}) 

= Poi(a){ni} = e "—r. 

ni! 

Then the second row is identified by the values F 2 ,fc, with A: = 1,..., Li = iVi, and 
by N 2 , so that 


P{F2 = f2\F,) = 

P{F2,k = /2,fefor A: = 1 ,... ,Li, iVa = ^2 = cardjA: > Li : f2,k = = 

Li 

JJ P2{k)P'^{l - P2{k)Y-P'^ X Poi(A2){n2}, 
k=l 


where P 2 {k) is defined in (3.1) and A 2 is defined in (3.2). The general formula is 


P(F, = /,|Fi,...,F,_i) = 

PiFi,k = fi,kioT k = 

Ni = ni = card{A: > Li_i : fi^k = • • •, Fi-i) = 

Li-i 

J] P,(A))A'=(i -P^(A:))1 -A. X Poi(A,)K}, 

k=l 


where Pi{k) is defined in (3.1) and A* is defined in (3.2). Thus, for n nodes, we can 
write a formula for the probability of observing Pi = /i,..., P„ = /„: 


P(Pi = /i,...,P„ = /0 = 

n 

P{Fr = f\)\[P{F, = h\F,,...,Fi_,). 
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Therefore, we look for 5 that maximizes the likelihood function, i.e. the quantity 
P{Fi = fi,... ,Fn = fn) as a function of 5 (given the observed vectors fi). Since 
some factors do not depend on 6, we can simplify the function to be maximized as 

n Li_i 

n n - Pi{k)y~p\ (5.3) 

i=2 k=l 

or, equivalently, passing to the logarithm, as 

n 

In {Fi{k)) + (1 - /,,) In (1 - F,{k)). (5.4) 

i=2 k=l 


Now, suppose that we are also allowed to observe the adjacency matrix A = 
(Aj j)i<jj<„ (meaning the final adjacency matrix after the arrival of all the n ob¬ 
served nodes and the formation of all their links) and to know which are the links 
that each of the n observed nodes formed only by means of the previously described 
first phase (i.e. only due to assortativity). Denote by A' = (^ij)i<ij<n the ad¬ 
jacency matrix collecting them. Then, if we decide to model the function $ as in 
(4.1), we can choose K, d, and p, in order to fit some properties of the observed 
matrices A' and A. For instance, if ^ is the number of observed (undirected) links 
in matrix A' (i.e. only due to the first phase of network construction) and 


observed number of linked (in A') pairs of nodes with s* features in common 
observed number of pairs of nodes with s* features in common 

where s* is a fixed value that we choose, then we can determine iF > 0 and ■d G M 
by solving (numerically) the following system of two equations: 


$(s*) = + =/ 

E 


E 

n 2—1 


Ki 


n 2—1 

i=2 j=l 


K(i)-s*)+Kis*-Y:tLiFi,kFj,k)\ 


i=2 j=l 


(5.5) 


By means of the first equation, we fit the probability that a pair of nodes with s* fea¬ 
tures in common establishes a link (during the first phase of network construction); 
while, by the second equation, we set the expected number of links in A' equal to the 
observed ^. From the first equation, we get the quantity K{d — s*), we then replace 
it in the second one in order to obtain K and from this we get d. Note that this 
is not a proper estimation procedure, but rather a selection mechanism for K and 


9 






in order to fit some observed properties of the network. After that, we can esti¬ 
mate p by means of a maximum likelihood procedure based on the observed matrices. 

Some important remarks follow. If in the considered situation the formation of 
links only occur according to the first phase (i.e. as a result of the assortativity 
property), then we can set p = 0 as in this case the presence of closed triplets is 
only caused by common features and the matrix A coincides with A'. Then we 
have no problem to implement the previous procedures for detecting all the model 
parameters. When we have both phases of network construction (i.e. p > 0), the 
detection of K, "d, and p may generate some problems since the available data are 
typically F and A, while, in order to implement the above procedure, we also need 
to observe A'. When we cannot observe A', we may try to reconstruct it from A in 
some consistent way, if it is possible for the considered application |38]. However, 
every empirical criterion used to distinguish between the two different types of links 
(the ones due to the hrst phase and the ones induced by the second phase), obviously 
has some degree of arbitrariness and it can be hard to understand the bias implied 
by it. An example of this problem can be found in [13] regarding a citation network. 
In the case no suitable criterion is found, we may try to select K, -d, and p in such a 
way that some properties of the adjacency matrix generated by the model are close 
to the observed one. Statistical procedures that integrate out unobserved variables 
(in this case. A') or expectation-maximization (EM) algorithms are also possible 
and they will be subject of future developments. Therefore, although assortativity 
and triadic closure are theoretically well separated concepts, in practice there are 
situations in which estimating them singly is not a simple task. However, their 
combination is often necessary in order to get models that produce realistic networks. 
The simulation of the model with the observed matrix F and p = 0 can be useful 
as a benchmark. 


6 Simulations 


In this section, we present a number of simulations performed following the dynamics 
for the features’ selection and links’ creation described in Section [3l We simulated 
the outcome for feature matrices and for unipartite networks of 1000 nodes, on a 
sample of 100 realizations. Regarding the feature-selection dynamics, we analyzed 
the resulting feature matrices (constructed as explained in Section for different 
values of the model parameters a, /3, and 6, responsible respectively of the number 
of new features per node, the asymptotic behavior of Ln dehned in (|2.1 ), and the 
phenomenon of preferential attachment in the transmission of the features to new 


nodes. After that, we simulated the network construction taking $ as in (4.1) and 


analyzed its properties for different values of 6, K, and p, while'd is determined 
according to a certain number i of (undirected) links due to the first phase of the 
unipartite network construction. 
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a=3, 13 = 0.5,5 = 0.1 a = 8, |3 = 0.5, 8 = 0.1 a = 13. |3 = 0.5, 6 = 0.1 



Figure 1: An example of features matrices for n = 1000, (3 = 0.5, 5 = 0.1, and 
different values of a : 3 (left), 8 (middle), 13 (right). Colored points denote 1 and 
white points denote 0. 

6.1 Simulations of the feature matrix and estimation of ct, /3, 
and d 

As said before, parameter a is responsible for the number of new features per node: 
the larger a, the higher the number of new features per node. Concerning this, it is 
very important to stress that also the parameter (3 affects the number of features per 
node, but the idea is that we select hrst (3, in order to £t the asymptotic behavior 
of L„, and then a in order to fit the number of new features per node. 

In the hrst set of simulations we kept (3 = 0.5 and 6 = 0.1 hxed and we built the 
feature matrix for different values of a = 3, 8, 13. In Figure we can see the shapes 
of the feature matrices (where colored points denote non-zero values, i.e. 1) for the 
three different values of a. It is immediate to see that the main difference among 
these matrices concerns the number of features: the total number of features is 185 
for a = 3, 533 for a = 8, and 819 for a = 13. Correspondingly, the mean number 
of new features per node (averaged over 100 realizations) is about 0.19 for a = 3, 
0.49 for a = 8, and 0.8 for a = 13. The mean number of (total) adopted features 
per node (averaged over 100 realizations) is about 19.99 for a = 3, 52.66 for a = 8, 
and 79.65 for a = 13. 

In Figure]^ we show the estimates for the different values of a (with (3 = 0.5 and 
5 = 0.1 kept hxed). 

Parameter (3 controls the asymptotic behavior of L„. For this reason we plotted 
Ln as a function of n in a log-log scale, results are reported in Figure In Figure 
1^ (a)-(b), we show the estimates for two diherent values oi (3 {(3 = 0.75 and f3 = 1), 
with a = 3 and 5 = 0.1. In Figure]^ (c)-(d), we show the estimate of (3, for (3 = 0.5 
and /S = 0.75, but for a diherent value of a {a = 10) in order to underline that a 
does not ahect the power-law behavior of (obviously, the value of the estimate 
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a = 3, p = 0.5, 5 = 0.1, estimate for a = 3.12 


a = 8, p = 0.5, 5 = 0.1, estimate for a = 8.35 


a = 13, p = 0.5, 5 = 0.1, estimate for a = 13.47 



Figure 2: Estimates of a (when /? = 0.5 and 6 = 0.1) obtained as the slope of the 
regression line in the plot of as a function of n^. Different values of a : 3 (left), 
8 (middle), 13 (right) are reported. 


can be more or less accurate for different values of a). 


Finally, parameter 6 regulates the phenomenon of preferential attachment: <5 = 0 
corresponds to the pure preferential attachment case; while 5 = 1 to the pure i.i.d 
case with inclusion probability equal to 1/2. The parameter 6 is estimated through 
the maximization of the likelihood function in Equation (5.4). Results for the esti¬ 
mated parameters are reported in Table [1} 


T1 0 ol 02 03 04 05 06 07 08 09 

<5 0.0002 0.1002 0.2002 0.296 0.401 0.495 0.603 0.703 0.8 0.9 1.007 


Table 1: Estimates of 6 computed as the maximum point S of the likelihood 
function in formula (5.4) with a = 10 and f3 = 0.5. 


In order to assess the accuracy of our estimation procedures, we checked the 
Mean Squared Error (MSE) for all the three parameters. More precisely, taking a 
sample oi R= 100 realizations, we computed the quantities 

-| R R -| i? 

MSEa = -J2i^r-af, MSE0 = -J20r-f3)^ MSEs = 

r=l r=l r=l 

where a, /3, 6 are the values used to generate all the 100 realizations and f3r, 
are the estimated values associated with the realization r. For a = 10, /? = 0.5, 6 = 
0 .1, we obtained the following values: 

MSEa, = 1.18, MSE^ = 0.0004, MSEs = 9 x 10“^. 

In Figure we show the shapes of the feature matrices (where colored points 
denote non-zero values, i.e. 1) for different values of 5 = 0.1, 0.5, 0.95 (two different 
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a = 3, p = 0.75, estimate for p = 0.77 


a = 3, p = 1, estimate for p = 1.001 



(a) (b) 

a = 10, p = 0.75, estimate for p = 0.76 “ = 10. P = 0.5, estimate for p = 0.53 



(c) (d) 

Figure 3: Estimates of (3 obtained as the slope of the regression line in the log-log 
plot of Ln as a function of n. Different values of a and (3 are reported: a = 3, /3 = 
0.75 (a), a = 3, l3 = 1 (b), a = 10, /? = 0.75 (c), and a = 10, /3 = 0.5 (d). 
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values of a = 3, 8 and a fixed value of /3 = 0.5). Although the number of new 
features for each node is comparable for different values of 6 and a fixed value of 
a (indeed, the parameter 6 does not affect the number of new features per node, 
but only the transmission of the old features to the subsequent nodes), the number 
of old features selected by the nodes depends on 6: the more 6 is near to zero, the 
more the probability of showing an old feature depends on how many other nodes 
selected it (preferential attachment). This fact is pointed out by the “full” vertical 
lines, that are concentrated on the left-hand side (since the preferential attachment 
phenomenon, the first features are more successfully transmitted). For greater val¬ 
ues of S, the matrices become denser and they present a more uniform distribution 
of the features among the nodes. The mean number of (total) adopted features per 
node for a = 3 and 6 equal to 0.1, 0.5, and 0.95 (averaged over 100 realizations) is 
about 19.99, 44.24, and 71.49 respectively; while for a = 8 and same values of 6 it 
is approximately equal to 52.66, 128.17, and 167.63 respectively. 

In order to measure the “uniformity” of the distribution of the features among 
nodes, we simply divided the total set of the features into two subsets: { 1 ,..., [L„/ 2 J } 
and { [Ln/2\ -|-1,..., L„}. For each feature, we computed the mean number of nodes 
that adopted it (i.e. the total number of nodes that adopted the considered feature 
divided by the total number of nodes that could have adopted it). Then we com¬ 
puted the mean value of these numbers over the two subsets and took the difference 
between these two values. For different values of a and 6, Table [^contains the corre¬ 
sponding values (averaged over 100 realizations) of these differences. It is clear that 
the smaller the reported value, the more uniform is the distribution of the features 
in the matrix. We can notice that for 5 = 0.1 and 6 = 0.5 the obtained values are 
comparable (about 0.10 and 0.11); while for 6 = 0.95 we got a very small value. 



<5 = 0.1 

<5 = 0.5 

5 = 0.95 

Q = 3 

0.1005 

0.1119 

0.0099 

0 = 8 

0.1010 

0.1129 

0.0097 


Table 2: Measure of the “uniformity” of the feature matrix defined as the 
difference (averaged over 100 realizations) between the mean number of nodes per 
feature for the first and the second half of the features’ set. Considered parameters: 
a = 3, 8, /I = 0.5 and 5 = 0.1, 0.5, 0.95. 


6.2 Simulations of the unipartite network and procedure in 
order to recover K and t? 

We performed the simulations of the unipartite network as follows. Once a feature 
matrix F is generated, links are created according to the two phases of the link 
construction described in Section]^ taking $ as in (4.1). We simulated the network 
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a = 8,p = 0.5, 5 = 0.1 


a = 8,p = 0.5, 5 = 0.5 


a = 8,p = 0.5, 5 = 0.95 



■' 00 ®! 107 214 321 428 - 533 . 

Features 


^4 


Features 


Features 


Figure 4: Examples of features matrices for n = 1000, /3 = 0.5, different values 
of a : 3 (top), 8 (bottom) and different values of 5 : 0.1 (left), 0.5 (middle), 0.95 
(right). Colored points denote 1 and white points denote 0. 


for n = 1 000 nodes on a sample of 100 realizations. 


In the first set of experiments, we hxed a number I of links and we determined 
the value of'd, for different values of iC, by solving (numerically) the equation 

n i—1 ^ 

+ =i, (6.1) 

i=2 j=l 

in order to have the expected number of (undirected) links due to the first phase of 
the unipartite network construction equal to the given number i. Hence, we studied 
the network structure as a function of the parameters K and p (related to the link 
formation). In particular, we recall that p increases the triadic closure phenomenon. 
We also considered different values of <5, that regulates the preferential attachment 
in the transmission of the features and so influences the shape of the feature matrix 
F. In the Appendix we report the results. 


With the second set of experiments, we studied the accuracy of the procedure 
(5.5) used in order to recover K and d. Hence, we hxed a = 10, /3 = 0.5, 5 = 0.1, 
A = 1, 'd = 10, and p = 0 (so that A' = A) and we generated a sample of i? = 100 
realizations of the network. We then applied the procedure (5.5) to each realization 
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* = 1(0 in order to get the corresponding values Kr ; 


1 ^ 

1 ^ 

= 1.000462, 

r=l 

MSEK = -J2A-Kf 

V=\ 


1 ^ 

1 ^ 

= 9.998843, 

r=l 

MSE^=-Y^{dr-^f = 

r=l 


7 Application to a co-authorship network 

We downloaded bibliographic information of papers and preprints fonnd in the IEEE 
Xplore database [62] . In this dataset a link is taken as the co-anthorship of a paper 
between two or more authors and the contexts of the papers are given by 2 -grams 
(pairs of sequential words in the title or abstract). We selected the papers using 
search terms related to the specihc research area of autonomous cars (also called 
connected cars). 


7.1 Description of the dataset 

We downloaded (on Aug. 7, 2014) ah papers in the IEEE preprint and paper archive 
using 17 specihc search terms: ‘Lane Departure Warning’, ‘Lane Keeping Assist’, 
‘Blindspot Detection’, ‘Rear Collision Warning’, ‘Front Distance Warning’, ‘Au¬ 
tonomous Emergency Braking’, ‘Pedestrian Detection’, ‘Traffic Jam Assist’, ‘Adap¬ 
tive Cruise Control’, ‘Automatic Lane Change’, ‘Traffic Sign Recognition’, ‘Semi- 
Autonomous Parking’, ‘Remote Parking’, ‘Driver Distraction Monitor’, ‘V2V or V2I 
or V2X’, ‘Co-Operative Driving’, ‘Telematics & Vehicles’, and ‘Night vision’. The 
IEEE archive returned ah the papers in their database that contain these terms in 
the title or abstract, and we downloaded the bibliographic records for all returned 
papers including the authors, title, abstract, and the date on which the paper was 
added to the database. This download yielded 6 129 distinct papers with a complete 
bibliographic record and at least two authors. While these search terms can not be 
expected to yield all papers related to automated car research, we expect to have 
found a relatively broad panel of related papers. 

7.2 Analysis of the feature-structure 

The feature matrix was built by extracting all 2-grams (pairs of words) appearing in 
either the title or abstract of a paper. The text was converted to lowercase, removing 
ah punctuation (with the exception of the ‘/’ and ‘.’ characters) and multi-spaces, 
and split into individual sentences. The 2-grams occurring in any sentence in the 
title or abstract were labeled as features of the paper. In order to remove spurious 

^We also consider different values for s* and we obtain similar results. 
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2 -grams (e.g. ‘this paper’ often occurs in the abstract, but it is not relevant to 
connected cars), we exclude any 2-grams containing any of the words: ‘the’, ‘a’, 
‘of’, ‘and’, ‘to’, ‘is’, ‘for’, ‘in’, ‘an’, ‘with’, ‘by’, ‘from’, ‘on’, ‘or’, ‘that’, ‘at’, ‘be’, 
‘which’, ‘are’, ‘as’, ‘one’, ‘may’, ‘it’, ‘and/or’, ‘if’, ‘via’, ‘can’, ‘when’, ‘we’, ‘his’, 
‘her’, ‘their’, ‘this’, ‘our’, ‘into’, ‘has’, ‘have’, ‘only’, ‘also’, ‘do’, ‘does’, ‘presents’, 
‘paper’, ‘doesn’t’, and ‘not’. This approach gave 155 897 distinct 2-grams (features) 
for a total of 6 129 papers (nodes). We ordered the papers chronologically based on 
their entry date into the IEEE database (which we expect to be a good proxy for 
their publication date). The 2-grams were ordered in terms of their first appearance 
in a paper (as described in Section]^. 


Having extracted the set of the 2-grams contained in each paper, we constructed 
the feature-matrix F, with Fik — 1 if paper i contains the 2-gram k and Fik — 0 
otherwise. The resulting matrix F is shown in Fig. [^a), with non-zero values of 
F indicated by colored points. We also simulated the feature-matrix for a smaller 
network of 1000 nodes taking the parameters equal to the corresponding estimated 
values (see Fig. [^b)). The number of features obtained in the simulation is 28 664, 
which is consistent with the observed matrix. 

The growth of the cumulative count of the distinct 2-grams (the number 
of distinct 2-grams seen until the paper included, as described in Section 
is shown in Fig. [^b) in a log-log scale and it shows a clear power-law behavior, 
with estimated parameter (5 = 0.98 (that corresponds to the estimated value of the 
model parameter (3). Regarding the model parameter a, we get the estimated value 
a = 32.28 and in Fig. |^a) we show the corresponding £t plotting the cumulative 
count Ln of the 2-grams as a function of n^. Finally, the estimated value for the 
parameter 5 is 5 = 0.0057. As we can see, this last value is very small and so we can 
conclude that the preferential attachment rule in the transmission of the features 
plays an important role. 


7.3 Analysis of the unipartite network 

Our dataset includes 6 129 papers for a total of 13 581 distinct author names. The 
considered unipartite network is constructed taking the papers as nodes and draw¬ 
ing a link between two nodes if they share at least one author. We harmonized the 
author names across different papers by ensuring that the authors’ last names are 
always found in the same position and removed any stray punctuation in the names. 
No further disambiguation was performed, meaning that authors who may use their 
full names in some papers but only their initials in other papers will be treated as 
distinct. For example, the names “J. J. Anaya” and “Jose Javier Anaya” are treated 
as distinct authors in our dataset, while it is possible that these distinct names re¬ 
fer to the same person. A full disambiguation of author names is computationally 
difficult [39], and beyond the scope of this paper. This approach gave a unipartite 
network with 19 065 links that involve 4 712 nodes in the network. This means that 
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Features 


(a) (b) 

Figure 5: (a) Feature-matrix associated to the dataset. Dimensions: 6 129 nodes 
(papers) x 155 897 features (2-grams). Colored points denote 1 and white points 
denote 0. (b) Feature matrix for 1000 nodes, obtained by simulation of the model 
with a = a = 32.28,13 = 13 = 0.98, and 5 = 6 = 0.0057. Colored points denote 1 and 
white points denote 0. The total number of features is 28 664, which is consistent 
with the observed matrix. 


Estimate for a = 32.28 


Estimate for p = 0.98 




Figure 6: Estimated values of the model parameters a and /?. 


18 











there is a set of 1417 isolated nodes, where a paper has two or more authors that are 
not listed on any other paper in the dataset. However, we decided to also consider 
these nodes in our analysis since we included them in the features matrix as nodes 
that can potentially link to other nodes. 


The distribution of the 2-grams (the features) in common between two papers 
(the nodes) given the presence or the absence of at least one shared author (i.e. 
given the presence or the absence of a link between them) is plotted in Figure [^a). 
The curve with (red) triangles is the distribution of the number of 2-grams shared 
by two papers given they have at least one co-author. More precisely, for each value 
on the x-axis, we have on the y-axis the fraction 


num. of pairs of papers with x 2-grams in common and at least 1 shared author 
num. of pairs of papers with at least 1 shared author 

(7.1) 

The curve with (green) stars represents the distribution of the number of 2-grams 
shared by two papers given they have no authors in common, i.e. it is given by the 
same formula as (7.1) but with pairs of papers without shared authors. As we can 
see, there is a higher probability of common 2-grams when there are shared authors. 


The fraction of pairs of papers with x 2-grams in common that have at least 
one shared author is plotted in Figure [^b) by the curve with (red) triangles. More 
precisely, for each value on the x-axis, we have on the y-axis the fraction 


num. of pairs of papers with x 2-grams in common and at least 1 shared author 
num. of pairs of papers with x 2-grams in common 

(7.2) 

As we can see, the plotted fraction increases with the number of features in common. 


The network is composed of 586 connected components with at least one edge 
and 1417 isolated nodes (a total of 2 003 components). The largest connected com¬ 
ponent has 2 776 nodes and 16 108 links, so about the 45% of the nodes can reach 
each other in the largest connected component and it includes about the 84% of 
the links. The diameter (i.e. the maximum distance between nodes) of the largest 
connected component is 23. The other 585 connected components (disconnected 
from the largest component but still having at least one edge) globally contain 1 936 
nodes, and over 90% of the components (containing over 75% of the nodes outside 
of the largest connected component) contain 7 or fewer nodes. Hence the percentage 
of reachable pairs (denoted by RP in the remainder of the paper) of nodes in the 
network is about 20.51%. 


We decided to hrst use the model with p = 0 in order to have a benchmark and 
then try to guess a good value for p. Taking p = 0, we set A' = A (i.e. links are 
only formed by means of the hrst phase) and we applied the procedure (|5.5) to the 


19 










# of common features # of common features 

(a) (b) 

Figure 7: (a) Distribution of the 2-grams (features) in common between two papers 
(nodes) given the presence (red triangles for the real data and light blue circles 
for the simulated ones) or the absence (green stars for the real data and dark blue 
squares for the simulated ones) of at least one co-author, (b) Fraction of pairs of 
papers with x 2-grams in common that have at least one co-author, for the real data 
case (red triangles) and for the simulated one (light blue circles). 


observed feature-matrix F with s* = 10 (the corresponding value for f* is 0.725) 
and i = 19 065 in order to detect K and "d: we found K = 0.8228 and'd = 8.8201. 
We then generate a sample of 100 realizations of the network by simulating the 
model starting from the observed matrix F and with p = 0, K = K = 0.8228, and 
?? = ?? = 8.8201. We obtained a network structure very different from the observed 
one (for instance, RP = 99%). This can be obviously explained by the fact that we 
set p = 0 (benchmark case), while a value of p strictly greater than 0 is guessable. 
Indeed, an author with m > 3 papers automatically guarantees a minimum of 
triangles. Setting p = 0.7 and generating a sample of 100 realizations of the network 
by simulating the model starting from the observed matrix F we succeeded to 
capture a value for RP very near to the observed one, i.e. RP = 19.61% (this value 
is an average over the 100 realizations). Moreover, we obtained that the largest 
connected component contains on average 2 689.16 nodes, again a value near to 
the observed one. Finally, Figure [^a) contains the distribution of the features in 
common between two nodes given the presence (light blue circles) or the absence 


^In this case we took into account that A! is different from A, and so the parameters K and d 
used for the simulations were recovered by applying the procedure (5.5) to the observed feature- 
matrix F with a smaller £ (that corresponds to the expected number of links formed during the 
first phase). We set £ = 4 000 in order to have an averaged total number of links around the 
observed one. We found K = 1.019574 and d = 9.047858. 
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(dark blue squares) of a link between them and Figure [^b) depicts the fraction 
of pairs of nodes with x features in common that are linked. Although the curves 
related to real data are obviously more irregular, the curves generated by simulations 
properly £t to the observed ones. 


8 Conclusions and discussion on some variants of 
the model 

In this paper, we presented a new network model, where each node is characterized 
by a number of features and the probability of a link between two nodes depends 
on the number of features and neighbors they share, so that it includes two of the 
most observed phenomena in complex systems: assort at ivity, i.e. the prevalence of 
network-links between nodes that are similar to each other in some sense, and triadic 
closure, meant as the high probability of having a link between a pair of nodes due to 
common neighbors. The bipartite network of nodes and features grows according to 
a stochastic dynamics that depends on three parameters respectively regulating the 
preferential attachment in the transmission of the features to the nodes, the num¬ 
ber of new features per node, and the power-law behavior of the total number of 
observed features. We provide theoretical results and statistical tools for the estima¬ 
tion of the model parameters involved in the feature-structure dynamics. From the 
observation of the feature-matrix, we completely determine the parameters, a,f3,6, 
that regulate its evolution. We provide a procedure for recovering the two parame¬ 
ters, K, 'd, of the function $, which relates the link probability between two nodes 
to their similarity in terms of common features, and the parameter p which tunes 
triadic closure. However, as discussed in Section for this last point, we need to 
know which are the links formed by assortativity and those formed by triadic clo¬ 
sure, but often they are not easily distinguishable. Therefore we aim in the future 
to evaluate more sophisticated estimation techniques for this issue. Nevertheless, as 
shown in Section we can still exploit the proposed procedure in order to guess a 
good combination of these parameters. 

The originality and the merit of our model mainly lie in the double temporal 
dynamics (one for the feature-structure and one for the network of nodes), but also 
in the attention given to both assortativity and triadic closure mechanisms. We 
underline that, differently from other models in the literature, we do not require 
to specify a priori the values of some hyperparameters, such as the total number 
of possible features (avoiding some selection problems discussed in [37j). In the 
future, we aim at improving our model in order to make it suitable for other kind 
of networks, e.g. real social networks (such as friendship networks). In particular, 
the following variations are possible: 

• Normalizing the number of common features: We can vary the model by re- 
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placing the factor Fi^kFj^k in formula (3.3) with 


, V(i, j) with 1 < j < f - 1, 

so that the contribution of a common feature k is smaller when the number of 
nodes with fc as a feature is larger. 

• Weighted bipartite matrices: We can modify the model by replacing in the 
inclusion-probability and in the link-probability the binary random number 
Fi^k by a random weight Wi^k of the form Wi^k = Fi^kXi^kl (Sfcii Fi^kyi,k), where 
Yi^k are i.i.d. strictly positive random variables. (By convention, we set 0/0 = 
0.) Hence, we have 


Li 

Wi^k e [ 0 , 1 ] and ^ Wi^k = 1 

k=l 

so that Wi^k represents the weight percentage given to feature k by node i. 
Therefore, the preferential attachment in the inclusion-probability becomes 
a “weighted preferential attachment”, in the sense that it depends on the 
total weight given to feature k by the previous nodes, and the link-probability 
depends on the weights associated to the common features. 

• Changeable links: For some real situations, we need to consider also the case 
in which the links among nodes can change along time. For instance, we can 
combine a link-formation model and a link-dissolution model as in |36]. See 
also |2S] for node exit. 

• Exit of some features and social influence of links on features: We can modify 

the evolution of the feature-structure by accounting for the fact that at each 
time step j (after the arrival of the node j) some features can become “obso¬ 
lete” and so for such a feature k we will have Fi^k = 0 for alH > j -|- 1. More¬ 
over, a node could change some features under the influence of its “friends” 
(i.e. neighbors) |26]. Hence, we can introduce a sequence of bipartite 

matrices such that each provides the features before the arrival of node 
i + so that in the inclusion-probabilities and in the link-probabilities for 
node i -t-1, the matrix F is replaced by F^^f 

• Different dynamics for triadic closure: We can change the second phase of our 
model by means of different policies for the selection of additional neighbors 
of a node i among the neighbors of Fs neighbors. Indeed, in this paper we 
consider a binomial model according to which each common neighbor of a pair 
{i,j) of not-linked nodes gives, independently of the others, a probability p of 
inducing a link between i and j. A possible alternative is that, with probability 
p, an additional link for a certain node is formed by the selection (uniformly 
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at random) of a node among the neighbors of its neighbors (e.g. [TO]). 
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A Appendix 

A.l Proof of the asymptotic behavior of 

Theorem A.l. Consider our model, the following statements hold true: 

a) Ln/\n{n) a for jd = 0; 

b) Ln/n^ a/fd for (d G (0,1]. 
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Proof. Set Ai = a and recall that the random variables iVj are independent and each 
Ni has distribution Poi(Aj). 

The assertion b) is trivial for /3 = 1 since, in this case, is the sum of n 
independent random variables with distribution Poi(Q;) and so, by the classical strong 
law of large numbers, Ln/n —4 a. 

Now, let us prove assertions a) and b) for [5 G [0,1). Dehne 

Oi 

\{j3) = a if (3 = 0 and \{(3) = — if /d G (0,1), 

r 

(inil3) = logn if /d = 0 and a„(/d) = if /d G (0,1). 


We need to prove that Ln/an{(3) A(/d). First, we observe that 

an[(3) an[(3) 

Next, let us dehne 


rp r\ A ~ E[Ni] Ni — Xi 

r„ = 0 aud am 


i=l 


«*(/^) 


Then (T„) is a martingale with 


-F[(iVi Aj)^] ^ A, 


and so sup„E[T2] = 
necker’s lemma implies 


< +CX). Thus, {Tn) converges a.s. and the Kro- 




that is 


ELi N. ELi A. 

^n{,( 3 ^ (^n{,( 3 ^ 

Therefore, we can conclude that 

L 


0 . 


n an{(3) 


" =limTL^ = li,„S^ = A(« a.s. 

n an{(3) n an{(3) 


□ 


a.s. 


Remark A.2. The above Theorem implies that In(Ljj)/ln(n) is a strongly consis¬ 
tent estimator of (3. Indeed, if /d = 0 then a ln(n) as n —)■ +cx); hence ln(L 

ln(Q;)+ ln(ln(n)), therefore ln(L„)/ln(n) ln(Q;)/ln(n)+ ln(ln(n))/ln(n) 0 = (3 

Furthermore, if /3 > 0, then we have {a/(3)n^ as n —)■ +cx) so ln(L„) 

\n{a/(3) + (3 ln(n), hence ln(L„)/ ln(n) \n{a/(3)/ ln(n) + (3 (3. 
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A.2 Simulations of the unipartite network: some analysis 
on its structure 

We generated feature matrices with n = 1 000 nodes taking fixed values for a and 
/d, i.e. a = 10 and {3 = 0.5, and different values for 5 {5 E [0.1, 0.5]). Start¬ 
ing from these feature matrices, we considered the structure of the unipartite net¬ 
work for three different values of K {K = 1,4, 10) and three different values of p 
(p = 0, 0.1, 0.5). 


We considered the following quantities: 


• the clustering coefficient dehned as: 


Number of closed triplets 
Number of connected triplets of nodes’ 


(A.l) 


where a connected triplet is a set of three nodes that are connected by two or 
three undirected links (open and closed triplet, respectively). See Table 


the fraction of pairs of nodes at distance at most 20, i.e. the fraction of pairs 
of nodes that are reachable from each other within at most 20 steps (see Table 

01 : 


RP2O — 


Number of couples of nodes at distance at most 20 
Number of couples of nodes 


(A.2) 


We recorded also the observed maximum value h* of the distance between the 
nodes. 


• the degree distribution, in the sense of the Complementary Cumulative Distri¬ 
bution Function (CCDF) of the number of neighbors of each node (see Figure 

g. 

The clustering coefficient C strongly increases with p (as expected). For p = 0 
the percentage of closed triplets increases with 6, but remains smaller or equal than 
13% of total triplets for all considered values of 6 and K. For values of p greater 
than zero, the percentage of closed triplets increases with h in a range of 13% — 30% 
for p = 0.1 and in a range of 39% — 62% for p = 0.5. The effect of K and 6 seems 
to be marginal on the clustering coefficient. 


Looking at the values obtained for the fraction of pairs of nodes at distance at 
most 20, for the two different values 5 = 0.1 and 5 = 0.5, we can notice a clear differ¬ 
ence in the behavior (independently of K and p)\ indeed, the fraction of reachable 
pairs for 5 = 0.1 (when K and p are hxed) is highly greater than the corresponding 
fraction for 5 = 0.5. Moreover, the fraction of reachable pairs decreases when K 
increases (and the other parameters are hxed) and slightly changes when only p 
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S = 0.1 

0.15 

0.2 

0.25 

0.3 

0.35 

0.4 

0.45 

0.5 



P = 

0 

0.04 

0.05 

0.07 

0.08 

0.08 

0.10 

0.13 

0.13 

0.10 

K = 

1 

P = 

0.1 

0.13 

0.17 

0.20 

0.23 

0.23 

0.24 

0.26 

0.27 

0.30 



P = 

0.5 

0.39 

0.45 

0.45 

0.49 

0.49 

0.47 

0.49 

0.53 

0.62 



P = 

0 

0.06 

0.06 

0.08 

0.09 

0.08 

0.11 

0.13 

0.13 

0.11 

K = 

4 

P = 

0.1 

0.15 

0.18 

0.21 

0.24 

0.23 

0.25 

0.26 

0.28 

0.30 



P = 

0.5 

0.42 

0.47 

0.46 

0.49 

0.49 

0.48 

0.50 

0.53 

0.62 



P = 

0 

0.06 

0.06 

0.08 

0.09 

0.08 

0.11 

0.13 

0.14 

0.11 

K = 

10 

P = 

0.1 

0.15 

0.18 

0.21 

0.24 

0.23 

0.25 

0.26 

0.28 

0.30 



P = 

0.5 

0.42 

0.47 

0.46 

0.49 

0.49 

0.48 

0.49 

0.53 

0.62 


Table 3: Clustering coefficient (averaged over 100 realizations) for a = 10, = 0.5, 

i = 4000, and different values of 5, and p. 


varies. The complementary fraction corresponds to the pairs of nodes at distance 
greater than 20 or not reachable from each other. 

The observed maximum distance h* (among pairs of nodes at distance at most 
20) varies in range of 2 — 5 and decreases when b {p and K, respectively) increases 
and the other parameters are hxed. 



K = 1 

K = 4 

K = 10 


5 = 

0.1 

0.5 

0.1 

0.5 

0.1 

0.5 

p = 0 
p = 0.1 

p = 0.5 


0.439 ( 5 ) 
0.438 ( 4 ) 
0.437 ( 3 ) 

0.128 ( 4 ) 
0.128 ( 3 ) 
0.128 ( 2 ) 

0.350 ( 4 ) 
0.352 ( 3 ) 
0.351 ( 2 ) 

0.118 ( 4 ) 
0.118 ( 3 ) 
0.118 ( 2 ) 

0.349 ( 4 ) 
0.350 ( 3 ) 
0.349 ( 2 ) 

0.117 ( 4 ) 
0.117 ( 3 ) 
0.117 ( 2 ) 


Table 4: Fraction of pairs of nodes at distance at most 20 (averaged over 100 real¬ 
izations) for a = 10, /d = 0.5, i = 4000, and different values of 6, K, and p. For each 
set of parameters, the corresponding observed maximum distance h* is reported in 
brackets. 


Finally, the effect of p on the total number of links is clear: when p = 0 the 
number of links is approximately equal to the chosen £ (i.e. £ = 4000), since in this 
case we have only the hrst phase of the unipartite network construction: links are 
related only to the features. The larger p the more triangles are closed and so the 
more links we have. Table [^reports the total number of links for all combinations 
of the parameters. Regarding the degree distribution. Figure shows the CCDF of 
the number of neighbors of a node. Parameter p also influences the shape of the 
degree distribution, together with b and K. 
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K = 1 

K = A 

K = 10 


5 = 

0.1 

0.5 

0.1 

0.5 

0.1 

0.5 

p = 

0 


4 003.47 

3 998.15 

4 002.17 

3 999.59 

3 997.13 

3 999.52 

p = 

0.1 


17853.46 

19 862.54 

19107.53 

19 523.42 

19112.46 

19 484.86 

p = 

0.5 


93 093.05 

43 538.68 

81343.97 

41382.62 

81039.49 

41156.34 


Table 5: Total number of links in the unipartite network (averaged over 100 realiza¬ 
tions) for a = 10, /3 = 0.5, i = 4000, and 6, K, and p varying. Note that for p = 0 
the number is around the chosen i = 4000. 


# of friends 


# of friends 



Figure 8: CCDF of the number of neighbors (averaged over 100 realizations) for 
a = 10, /3 = 0.5, i = 4000, and different values of K (corresponding to different 
boxes) and different values of 6 and p (corresponding to different symbols and colors). 
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